Bioinformatics Advances — Latest Matching Preprints

1

Discovering conserved regulatory modules in predicted gene regulatory networks across species

Zhang, J.; Heath, L. S.

2026-05-16 systems biology 10.64898/2026.05.15.725337 medRxiv

Top 0.1%

26.5%

Show abstract

The discovery of conserved regulatory motifs across different species is a fundamental challenge in systems biology, especially considering the noisy and incomplete nature of predicted gene regulatory networks (GRNs) and the intractability of the underlying graph alignment problem. Traditional network alignment methods frequently enforce one-to-one node mappings or strict topological isomorphism, which fail to accommodate the many-to-many orthology mappings caused by evolutionary gene duplication. Consequently, strict constraints often yield highly fragmented topological islands rather than cohesive functional modules. In this work, we propose a relaxed topological alignment algorithm designed to extract conserved regulatory structures from cross-species GRNs. We formulate the discovery process as a multi-objective optimization problem that balances sequence homology, functional coherence, and a normalized topological consensus. To navigate the exponentially scaling search space, we introduce a greedy seed-and-extend heuristic bounded by a dynamic{epsilon} -stopping condition, which evaluates marginal objective gains to prevent functional dilution. We validate our algorithm using time-series transcriptomic data from Arabidopsis thaliana, Zea mays, and Sorghum bicolor focused on drought and developmental stress responses. While a strict topological baseline extracted only fragmented subgraphs limited to 51 homologous tuples, our relaxed heuristic successfully converged on a highly connected 444-tuple module. The resulting topology effectively links strictly conserved upstream transcription factors to their highly duplicated, species-specific downstream pathways. Our algorithm provides a robust, scalable computational methodology for identifying core regulatory logic across complex biological systems, facilitating the translation of conserved network architectures among multiple species. Author summaryIdentifying shared regulatory mechanisms across diverse species is essential for understanding how complex biological systems evolve and adapt. However, traditional computer algorithms struggle to align these biological networks because evolution frequently duplicates genes, breaking simple one-to-one comparisons and producing highly fragmented results. To overcome this limitation, we developed a relaxed cross-species network alignment algorithm. Instead of demanding perfectly identical network shapes, our approach dynamically balances genetic sequence similarity, network structure, and biological function. We demonstrated the performance of our algorithm using plant drought-stress networks as a case study. While strict methods only found tiny, disconnected network fragments, our algorithm uncovered a functionally coherent, interconnected regulatory module across three distinct species. We discovered that while upstream command genes remain strictly conserved, they regulate highly customized, species-specific execution pathways downstream. Ultimately, our framework provides a scalable, species-agnostic method to decode complex systems, allowing researchers to translate conserved biological logic across diverse genomes.

2

MethylCurate: Tool For Dataset Curation and Epigenetic Aging Clock Evaluation

Edwards, T. A.; Shen, L.; Long, Q.

2026-05-14 bioinformatics 10.64898/2026.05.11.723515 medRxiv

Top 0.1%

14.9%

Show abstract

SummaryDNA methylation datasets from public repositories such as NCBI Gene Expression Omnibus are central to the development and evaluation of epigenetic aging clocks, yet existing resources and tools do not fully resolve the bottlenecks of dataset retrieval and metadata harmonization. Current benchmarking frameworks often rely on static curated collections, support only a subset of available Gene Expression Omnibus studies, focus on specific tissues, or require substantial manual intervention when metadata fields and supplementary files are inconsistently structured across studies. We developed MethylCurate, an agentic AI framework that addresses these limitations by automating the retrieval of DNA methylation datasets from the Gene Expression Omnibus, harmonizing heterogeneous metadata, mapping datasets to a unified format, and enabling scalable evaluation of epigenetic aging clocks through an integrated, dialogue-driven workflow. Availability and ImplementationMethylCurate is implemented in Python and combines deterministic modules for Gene Expression Omnibus dataset retrieval, quality control, and clock evaluation with large language model-assisted agents for metadata extraction, metadata harmonization, and DNA methylation data parsing. Source code, documentation, and example workflows are available at: https://github.com/Travyse/methylcurate Contacttravyse.edwards@pennmedicine.upenn.edu Supplementary InformationSupplementary data are available at Bioinformatics online. Graphical AbstractMethylCurate is an agentic-AI framework that converts user-specified NCBI Gene Expression Omnibus DNA methylation datasets into standardized metadata, beta matrices, artifacts, logs, and aging clock benchmarking outputs through automated retrieval, quality control, metadata extraction, harmonization, and evaluation workflows. Figure generated with Biorender. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=51 SRC="FIGDIR/small/723515v1_ufig1.gif" ALT="Figure 1"> View larger version (12K): org.highwire.dtl.DTLVardef@197c0fborg.highwire.dtl.DTLVardef@1feace4org.highwire.dtl.DTLVardef@108b0d5org.highwire.dtl.DTLVardef@191a1b8_HPS_FORMAT_FIGEXP M_FIG C_FIG Key MessagesO_LIAutomated curation of DNA methylation datasets from the Gene Expression Omnibus. C_LIO_LIStandardized preprocessing and metadata harmonization. C_LIO_LIIntegrated benchmarking of epigenetic aging clocks. C_LI

3

Uncertainty-aware graph representation learning with positive-unlabeled classification for biomarker discovery in peripheral artery disease

Ayyalasomayajula, V. S. R. K.; Senders, M. L.; Wolterink, J. M.; Yeung, K. K.

2026-05-13 systems biology 10.64898/2026.05.08.723757 medRxiv

Top 0.1%

14.3%

Show abstract

Peripheral artery disease (PAD) is a complex vascular disorder characterized by heterogeneous molecular mechanisms and incomplete functional annotation, limiting systematic biomarker discovery. Network-based learning approaches provide a powerful framework for disease gene prioritization; however, most existing methods produce overconfident predictions without explicitly accounting for model uncertainty or structural novelty. Here, we present an uncertainty-aware framework for PAD biomarker discovery that integrates unsupervised graph representation learning, positive-unlabeled (PU) classification, ensemble prediction, and mechanistic explainability. Node embeddings were learned using multiple unsupervised graph neural network (GNN) objectives and combined with heterogeneous classifiers to generate ensemble-averaged probability estimates and epistemic uncertainty. By jointly modeling predictive confidence and embedding-space novelty, we stratified candidates into high-confidence rediscoveries and structurally novel hypotheses under explicit uncertainty control. Across eight embedding objectives and five classifiers, ensemble aggregation produced stable, well-calibrated predictions and enabled prioritization of 100 candidate PAD-associated proteins. Probability-heavy candidates clustered tightly with known PAD proteins and were enriched for established vascular and hemostatic pathways, including extracellular matrix organization, integrin signaling, coagulation, and fibrinolysis. In contrast, novelty-heavy candidates occupied distinct embedding-space regions and partitioned into multiple coherent clusters enriched for upstream regulatory and signaling processes, including G protein-coupled receptor, ephrin receptor, kinase-driven, and NF-{kappa}B-associated pathways. Five-fold cross-validated comparison with established PU learning baselines demonstrated consistent improvement across all evaluation metrics (AUC 0.916 {+/-} 0.019 vs. 0.821 {+/-} 0.030 for the best baseline), and external validity was confirmed by significant enrichment of top candidates for related cardiovascular disease annotations (5.7x above background). Together, these results demonstrate that integrating uncertainty, novelty, and explainability enables calibrated and biologically grounded biomarker prioritization, with broad applicability to PAD and other complex diseases. Author summaryPeripheral artery disease affects millions of people worldwide but remains underdiagnosed, partly because we lack reliable molecular markers to detect it early. In this study, we developed a computational framework that uses protein interaction network data to predict which proteins may be involved in PAD, even when we only know a small number of confirmed disease-associated proteins. Our approach combines graph neural network embeddings with a machine learning technique called positive-unlabeled learning, which is specifically designed for situations where you have confirmed positives but no confirmed negatives. We also quantify how confident the model is in each prediction and identify candidates that are genuinely novel compared to what is already known. Tested against established methods, our framework consistently found more known disease proteins in cross-validated evaluation. The candidates we identified map to biologically coherent pathways relevant to vascular disease, and our top predictions are enriched for proteins associated with related cardiovascular conditions, providing external validation. This work provides a principled and transparent approach to biomarker discovery that could be applied to other complex diseases with limited molecular annotations.

4

MechAInistic: An LLM-guided Multi-Agent System for Reasoning over Genome-Scale Constraint-Based Metabolic Models

Loecker, J.; Pujara, N.; Bryant, W.; Puniya, B. L.; Packrisamy, P.; Hamed, A.; Helikar, T.

2026-05-13 systems biology 10.64898/2026.05.11.723319 medRxiv

Top 0.1%

12.4%

Show abstract

Constraint-based metabolic modeling is a powerful way to study the mechanistic basis of cellular states and disease, but effective use demands substantial computational expertise and careful coordination of multi-step analyses. We developed MechAInistic to lower this barrier enabling researchers to ask complex biological questions in natural language. MechAInistic is a multi-agent system harnessing large language models organized around an Architect-Reviewer pattern that that converts a natural-language question into an executable, model-grounded workflow and produces a structured report. It supports pathway comparison, perturbation analysis, drug-target exploration, and literature interpretation across healthy and disease paired states. We evaluated MechAInistics therapeutic hypothesis generation using two immune-cell use-cases. For rheumatoid arthritis/healthy Naive B models, it identified mitochondrial metabolic rewiring and nominated Devimistat/CPI-613 as an investigational OGDH-centered hypothesis. In CD4+ Th17 multiple sclerosis/healthy models, the workflow identified NADP-dependent isocitrate dehydrogenase as the optimal target and proposed Ivosidenib as an FDA-approved repurposing candidate. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=83 SRC="FIGDIR/small/723319v1_ufig1.gif" ALT="Figure 1"> View larger version (19K): org.highwire.dtl.DTLVardef@1b5c1d1org.highwire.dtl.DTLVardef@1c798cforg.highwire.dtl.DTLVardef@10161d3org.highwire.dtl.DTLVardef@1bd7dce_HPS_FORMAT_FIGEXP M_FIG C_FIG

5

BAT: an integrated pipeline for gene tree construction, annotation, and functional inference

Sheppard, B. D.; Behnken, B.; Steinbrenner, A.

2026-05-12 bioinformatics 10.64898/2026.05.07.721474 medRxiv

Top 0.1%

10.3%

Show abstract

Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.

6

Input design for unsupervised cross-national branded food database alignment using large language models

Nakagawa, S.; Yamamoto, A.

2026-05-25 nutrition 10.64898/2026.05.23.26353945 medRxiv

Top 0.1%

10.3%

Show abstract

Cross-national alignment of branded food databases is essential for international nutritional epidemiology but lacks standardized methods. Existing approaches - including food ontologies, domain-specific fine-tuned language models, and manual expert mapping - require either substantial infrastructure or do not scale to thousands of items. We propose an unsupervised evaluation framework for large language model (LLM)-based food database alignment that requires no ground-truth labels. Using the Japan Branded Food Database (JBFD; 9,519 items, 71 mid-level categories) and USDA FoodData Central (448 categories) as a case study, we introduce two complementary metrics: weighted centroid distance (nutritional proximity between matched category pairs) and dominant category share (structural consistency of category-level assignments). We then conducted a systematic ablation study across eight input conditions (A-H), varying combinations of product name, nutrient profile, and semantic category label. Results showed that nutrient-only inputs yielded poor structural consistency despite low centroid distances, while semantic category labels achieved the highest dominant category share (89.3%) but introduced circularity due to their LLM-derived origin. Among circularity-free conditions, product name combined with minimal nutrient information (energy, protein, salt; condition E) achieved the best balance of centroid distance (0.471) and dominant category share (65.8%). Model comparison across Claude Haiku, Sonnet, and Opus confirmed that NO_MATCH rates were consistent across model sizes (12-14%), suggesting that prompt design contributes more to alignment quality than model scale. These findings provide practical guidance for input design in LLM-based food database alignment without ground-truth annotation.Sonnet 4.6

7

Redesign selective protein binders using contrastive decoding

Xie, Z.; Xu, J.

2026-05-13 bioinformatics 10.64898/2026.05.09.722041 medRxiv

Top 0.2%

9.3%

Show abstract

MotivationFixed-backbone sequence design methods such as ProteinMPNN operate on backbone coordinates alone and cannot represent target side-chains at the binding interface. Their decoding algorithm also lacks a mechanism to balance binding affinity and folding stability or to improve selectivity against structurally similar off-targets. These gaps limit the computational design of protein binders with high affinity and specificity. ResultsWe present RedNet, a multiscale graph neural network that encodes side-chain information of the binding target. We further develop a contrastive decoding algorithm, motivated by the thermodynamic decomposition of binding free energy, that addresses two objectives: (1) balancing binding affinity and folding stability, and (2) improving selectivity against structurally similar off-targets. RedNet reaches 43% native sequence recovery on heterodimers, compared with 37% for ProteinMPNN and 33% for ESM-IF. With contrastive decoding, it matches native-sequence co-folding success (68%) on high-confidence AlphaFold3 targets, exceeding ProteinMPNN (59%) and ESM-IF (61%). On a new benchmark of structurally similar on-/off-target pairs, RedNet with contrastive decoding reaches 64.8% energetic selectivity, ahead of PiFold (55.6%), ProteinMPNN (53.7%), and ESM-IF (53.7%). AvailabilitySource code and datasets are released at https://github.com/zw2x/rednet_public. Contactjinbo.xu@gmail.com

8

Machine learning-based prediction of memory requirements for metagenomic assembly in high-performance computing environments

Zierep, P. F.; Faack, S.; Beracochea, M.; Sanchez, S.; Batut, B.; Finn, R. D.; Gruening, B. A.

2026-05-13 microbiology 10.64898/2026.05.12.724571 medRxiv

Top 0.2%

8.9%

Show abstract

Metagenomic assembly can be a computationally intensive step in microbiome analysis, with memory requirements that vary widely depending on input data characteristics. In workflow systems like Galaxy and large-scale platforms like MGnify, which run thousands of heterogeneous jobs, inaccurate memory allocation drives job failures and costly retries when underestimated, and reduces throughput when overestimated. Current approaches rely primarily on heuristic rules based on input file size or sample metadata, which often fail to generalize across diverse datasets. In this study, we present a machine learning-based framework for predicting memory requirements of metagenomic assembly using metaSPAdes. We analyzed 300 assembly jobs from diverse biomes and evaluated 18 predictive models using combinations of input file size, biome classification, and sequence-derived k-mer features. K-mer profiles were computed from raw sequencing data and summarized into statistical descriptors capturing sequence complexity and diversity. Model performance was assessed using both conventional regression metrics and a production-oriented cost function that accounts for retry policies and resource waste in high-performance computing environments. Our results show that machine learning models can outperform commonly used heuristics. In particular, models incorporating biome information achieved the best performance and can be tuned to favor conservative predictions that reduce job failure rates. Simpler models based solely on input file size also performed competitively, offering a practical alternative for systems with limited feature availability. When evaluated under realistic workload distributions, predictive approaches reduced total memory waste by several million gigabyte-hours per 1,000 jobs compared to static allocation strategies. These findings demonstrate that data-driven resource prediction can substantially improve efficiency in metagenomic workflows. The proposed framework is adaptable to different computational environments and provides a foundation for integrating predictive resource allocation into large-scale bioinformatics platforms beyond Galaxy.

9

Ensemble kinetic modelling links residual enzyme activity to clinical symptoms in mitochondrial β-oxidation defects

Odendaal, C.; Krebs, O.; Bakker, B. M.

2026-05-08 systems biology 10.64898/2026.05.05.722902 medRxiv

Top 0.2%

8.8%

Show abstract

The mitochondrial fatty acid {beta}-oxidation (mFAO) is an important source of energy when carbohydrate stores are depleted. It is also involved in many diseases, including inherited fatty-acid oxidation deficiencies (mFAODs). Patients with the same genetic variant often present with clinically heterogeneous phenotypes, but the mechanisms contributing to this heterogeneity are poorly understood. To investigate the underlying pathophysiology of different mFAODs, we constructed a computational model of mFAO in human liver, based on experimentally determined enzyme kinetics. A recognised, but seldom addressed challenge in metabolic modelling is the substantial uncertainty about kinetic parameter values. Whereas experimental values of some mFAO parameters are quite reproducible, others vary by up to four orders of magnitude between different reports. To address this, we generated an ensemble of kinetic models, each with the same reaction stoichiometry and rate equations, but different kinetic parameters, sampled from distributions of literature-derived values. We also comprehensively report these values and the arguments based on which they were evaluated. The resulting models were validated against available flux data, yielding a final ensemble of 51 valid models. These models recapitulate recent findings about the accumulation of medium-chain acyl-CoAs and the concomitant depletion of free CoA (CoASH) in medium-chain acyl-CoA dehydrogenase deficiency. We applied the ensemble to a set of known mFAODs, separating them into long-chain (LC-) and short-/medium-chain (S/MC-)mFAODs. The residual activity at which clinical symptoms are known to occur corresponded well with the residual activity in the model at which pathway flux was significantly decreased in LC-mFAODs. Residual activity in S/MC-mFAODs correlated less strongly with pathway flux, but these deficiencies did show a combined flux- and CoASH-reduction effect. This comparison is of importance to researchers and clinicians, as it identifies possible ways in which insights about one mFAOD may be applied to another based on shared biochemical properties. Author SummaryWhen building computer models of metabolic pathways, it is typical to take the "best" experimental data and use that as input into the model. However, especially when working with human cells, ethical and practical constraints often mean that even the "best" experimental data is still subject to substantial uncertainty. We explicitly modelled the uncertainty about the inner workings of fat burning (fatty acid oxidation). The resulting model is known as an "ensemble". The ensemble predicts ranges instead of single outcomes, allowing us to assess the confidence level of our predictions. We assess a set of inherited diseases - enzyme deficiencies - simulating them at different levels of severity with the ensemble. We find that the model does a good job of predicting the severity of the deficiencies at which symptoms will occur. It also allows us to identify a key difference between two subgroups within this group of deficiencies: long-chain and medium-/short-chain, depending on the size of the fats being metabolised. The long-chain variant is predicted to correlate most straightforwardly with the severity of the deficiencies, due to its effect on energy generation. Medium-/short-chain deficiencies, in contrast, have more complex consequences.

10

LIVIA: a browser-based tool for assessing and visualizing predicted protein interactions

Kim, A.-R.; Perrimon, N.

2026-05-10 bioinformatics 10.64898/2026.05.01.721633 medRxiv

Top 0.2%

8.4%

Show abstract

As protein structure prediction tools become widely adopted across biology, there is a growing need for accessible methods to assess and visualize predicted protein-protein interactions (PPIs). Here we present LIVIA (Local Interaction Visualization and Analysis), a browser-based tool that computes local PPI confidence metrics across multiple prediction platforms, identifies predicted interface residues, embeds an interactive Mol* 3D viewer, and generates visualization scripts for ChimeraX and PyMOL. The tool automatically detects prediction formats; all parsing and computation occur locally on the users machine. LIVIA is freely available at https://flyark.github.io/LIVIA.

11

ANYI: The ANnotated Yeast Interactome

Nissley, D. A.; Goel, M.; Castellanos-Girouard, X.; Kuntz, C. P.; Wang, Y.; Mukhtar, S.; Serohijos, A.; Schlebach, J. P.

2026-05-05 bioinformatics 10.64898/2026.04.30.721908 medRxiv

Top 0.2%

8.3%

Show abstract

Although several existing protein-protein interaction (PPI) databases provide yeast PPI data, none unify large-scale network topology information with detailed biophysical, proteostasis, and regulatory annotations in a single protein-centric framework. To address this gap, we developed the ANnotated Yeast Interactome (ANYI), an open, integrated resource that combines experimental yeast PPIs with sixteen feature annotation types, including protein abundance, half-life, disorder content, post-translational modifications, conformational stability, chaperone interactions, sequence, and structure. ANYI integrates 3,927 proteins with 155 annotation features, forming a unified matrix that enables systematic cross-layer analyses. Available via GitHub and Docker Hub with an interactive network browser for broad accessibility, ANYI provides both experienced and beginner computational scientists with tools to investigate the yeast interactome. For example, users can directly test whether highly connected hub proteins exhibit distinct stability, disorder, or proteostasis signatures relative to peripheral nodes. AVAILABILITY AND IMPLEMENTATIONThe code used to assemble ANYI is available on GitHub at https://github.com/NCEMS/energetic-origins-of-PPI-connectivity and the database itself and interactive browser tool are available on Docker Hub as dannissleypsu/anyi-browser:v1.0.2.

12

geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration

Feng, Z.; Li, T.

2026-05-07 bioinformatics 10.64898/2026.05.04.722831 medRxiv

Top 0.3%

8.3%

Show abstract

Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets--arising from annotation version updates, historical renaming, and synonym reassignment--introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020-2025 and five CellRanger versions shows that 1.41%-6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707-1,098 genes per dataset pair. Notably, CellRanger annotation version--rather than data collection year--was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync.

13

Disease-guided functional gene mapping across species reveals translational correspondences beyond sequence orthology

Yan, J.; Cao, Z.

2026-05-13 bioinformatics 10.64898/2026.05.10.720506 medRxiv

Top 0.3%

8.2%

Show abstract

Selecting the correct mouse gene to model a human disease phenotype is critical for translational research, yet sequence-based orthology fails when genes have been lost, duplicated, or functionally rewired between species. Here we present BRIDGE (Biological Rank Integration for Disease Gene Equivalence), a framework that identifies functional mouse equivalents of human disease genes without sequence input. BRIDGE integrates 3.37 million disease-gene associations, biological pathways, and Gene Ontology annotations into a unified heterogeneous graph (94,897 nodes, [~]8.3 million edges), encoded by a heterogeneous graph transformer with fused Gromov-Wasserstein alignment and multi-strategy reciprocal rank fusion. On two sequence-independent benchmarks, BRIDGE achieves Recall@5 of 61.8-66.7%, compared with 0.0-20.1% for Ensembl Compara. We validate BRIDGE through case studies including neutrophil pathway rewiring (CXCL8[->]Cxcl1/2/5), acute-phase divergence (CRP[->]Apcs), and immune checkpoint substitution (LILRB2[->]Pirb), and demonstrate complementarity with sequence methods in drug-translation analysis. Prospective validation of 30 novel predictions against three independent data modalities (tissue expression, cell-type expression, and phenotype concordance) shows that BRIDGE picks are favoured in 64 of 65 orthogonal tests (sign test P = 3.6 x 10-{superscript 1}) and significantly outperform all tested baselines including Ensembl Compara, BLAST RBH, and ESM-2. BRIDGE provides a benchmarked framework for functional cross-species gene mapping in disease-model design.

14

ProtmRNA: Cross-Modal Knowledge Transfer from Proteins to Messenger RNA

Xu, G.; Wu, X.; Ma, J.

2026-05-19 bioinformatics 10.64898/2026.05.19.726141 medRxiv

Top 0.3%

7.4%

Show abstract

MotivationAccording to the central dogma of molecular biology, messenger RNA (mRNA) sequences are directly translated into amino acid sequences, positioning mRNA as the fundamental intermediary between genetic information and functional proteins. This natural correspondence suggests that mRNA sequence analysis could greatly benefit from the rich evolutionary and functional representations learned by large-scale protein language models. ResultsProtmRNA repurposes the pre-trained ESM-2 protein language model for mRNA sequence processing via cross-modal transfer learning. Evaluated on mRNA- and protein-related datasets, along with eight additional benchmarks compiled in this study, ProtmRNA achieves performance comparable or superior to state-of-the-art mRNA language models while using less than half the pre-training computational resources. This work establishes the potential of cross-modal transfer learning between biological sequences by demonstrating that protein-derived knowledge can be efficiently transferred to mRNA, offering a resource-efficient paradigm for advancing mRNA sequence understanding. Availability and ImplementationThe pre-trained ProtmRNA model and the eight CDS-region regression benchmarks curated in this study are publicly available at https://github.com/pesenteur/ProtmRNA.

15

CN-RNN: a Deep Learning Framework for Copy Number Variation Detection with Exome Sequencing Data

Wang, D.; Qin, F.; Bao, W.; Bacher, R.; Chung, D.; Lu, Q.; Efron, P. A.; Cai, G.; Xiao, F.

2026-05-15 genetics 10.64898/2026.05.13.724920 medRxiv

Top 0.3%

7.0%

Show abstract

Copy number variations (CNVs) are major structural genomic variants that contribute to a wide range of human diseases. Accurate detection of CNVs from whole-exome sequencing (WES) data has been a long-sought goal for clinical and population genetic studies. Despite recent progress, existing WES-based CNV callers still suffer from high false-positive rates and reduced recall for short-length variants, and current deep learning methods have not fully used complementary information in region-level genomic features. Here we present CN-RNN, a deep learning-based CNV caller for WES data. The model combines a bidirectional long short-term memory (BiLSTM) branch that captures local depth changes and contextual dependencies across neighboring exons with a parallel multi-layer perceptron (MLP) branch that encodes region-level metadata such as GC content, mappability, and exon length. CN-RNN was trained on the Autism Sequencing Consortium (ASC) parent-child trio cohort using the Mendelian rule of inheritance to ensure high-quality training sets. It was evaluated across three independent datasets, in which we showed that CN-RNN outperformed existing WES-based CNV callers and deep learning methods. CN-RNN offers a scalable, accurate tool for CNV profiling in WES-based studies and supports broader application of CNV analysis in population and clinical research. CN-RNN is available at https://github.com/FeifeiXiao-lab/CN-RNN.

16

Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support

Khan, U. A.

2026-05-11 bioinformatics 10.64898/2026.05.05.723102 medRxiv

Top 0.3%

6.9%

Show abstract

Retrieval-augmented generation (RAG) is increasingly applied to clinical decision support in oncology, where treatment selection depends on identifying a patients specific somatic variant from an NGS report and matching it to evidence-graded therapy options. The vector retrieval that underlies most RAG systems uses cosine similarity over text embeddings, an architecture optimized for linguistic proximity rather than entity-level identity. We hypothesize that cosine-similarity-based retrieval conflates clinically distinct cancer variants at clinically relevant rates, while a typed-graph approach in which each variant is a discrete node preserves variant-level identity by construction. We evaluated 9 cancer variant pairs known to have differential FDA-approved therapy indications, with variant identity informed by the CIViC clinical variant evidence database and primary clinical literature. Variant pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. We computed pairwise cosine similarity for each variants text representation across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) and three text formats (short, medium, long). Across the medium format (gene + variant + tumor type), 100% of clinically distinct variant pairs (9/9) had cosine similarity [≥] 0.95 under both biomedical encoders (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) showed lower conflation in the medium format (11%) but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56% of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. The typed-graph baseline achieves zero conflation by construction. We discuss the architectural implications: vector retrieval is appropriate for unstructured literature search but introduces unsafe ambiguity when used as the substrate for variant-level reasoning that drives drug-selection decisions. We argue that typed-graph retrieval should be the default architecture for any retrieval-grounded clinical decision support system that recommends targeted therapy.

17

MIMOSA: A model-independent framework for transcription factor binding site motif similarity assessment

Tsukanov, A. V.; Levitsky, V. G.

2026-05-17 bioinformatics 10.64898/2026.05.13.725009 medRxiv

Top 0.4%

6.7%

Show abstract

MotivationTranscription factors (TFs) regulate gene expression by binding specific DNA sequences, which are commonly represented by motif models. Although position weight matrices (PWMs) remain the dominant motif representation, alternative models, such as Markov models, can capture interpositional dependencies and may provide higher predictive performance. However, existing motif comparison tools are designed mainly for PWMs or require motifs to be reduced to PWM/PPM representations. This creates a major bottleneck for comparing motifs represented by different model architectures. This limitation complicates the interpretation of de novo motif discovery results and hinders the systematic integration of diverse motif models into genomic analyses. ResultsWe present MIMOSA (Model-Independent Motif Similarity Assessment), a model-independent framework for direct comparison of TF binding site (TFBS) motifs regardless of their mathematical representation. MIMOSA assesses motif similarity by comparing calibrated recognition profiles produced by motifs of different models on the same DNA sequence set, rather than by comparing the motifs themselves. In a cross-database benchmark on HOCOMOCO motifs, MIMOSA achieved retrieval performance comparable to established PWM-oriented tools, including Tomtom and MACRO-APE, with MRR and Recall@k close to the best-performing methods. Pairwise ranking comparisons showed that MIMOSA captures a similarity signal consistent with existing approaches while providing a representation-independent comparison strategy. Application to de novo motifs derived from ChIP-seq data for the ATF3 TF demonstrated that recognition-profile comparison distinguished alternative spacer variants represented as separate PWMs from their integration within more flexible models such as BaMM and Slim. Thus, MIMOSA enables quantitative cross-model motif comparison and supports interpretation of motif heterogeneity in TFBS analyses. Availability and implementationMIMOSA is implemented in Python and is freely available at https://github.com/ubercomrade/mimosa.

18

Nutritional-Metabolic Lipid Profiling with LipidOne for plasma lipidomics interpretation in metabolic health

Frongia Mancini, D.; Alabed, H. B. R.; Pellegrino, R. M.

2026-05-18 bioinformatics 10.64898/2026.05.14.725104 medRxiv

Top 0.4%

6.5%

Show abstract

Background/ObjectivesHuman plasma lipidomics provides valuable information on dietary and metabolic phenotypes, but the interpretation of high-dimensional lipid datasets remains challenging. We developed the Nutritional-Metabolic Lipid Profile (NMLP) module within LipidOne to translate plasma lipidomics data into interpretable nutritional-metabolic indices, functional categories, visual outputs, and biological statements. Subjects/MethodsNMLP calculates lipid indices reflecting cardiometabolic lipid status, fatty acid remodelling, overall lipid quality, oxidative protection, and omega-3/essential fatty acid status. The module was applied to three human plasma lipidomics public datasets: a randomized crossover glycemic-load feeding study, a eucaloric high-fat diet intervention in normal-weight women, and a large public dataset stratified by insulin sensitivity. ResultsAcross datasets, NMLP converted complex lipidomic matrices into coherent nutritional-metabolic profiles. In the glycemic-load study, the module highlighted metabolic lipid shifts not captured by standard clinical lipid panels, mainly involving cardiometabolic lipid status, oxidative protection, and fatty acid remodelling. In the high-fat diet intervention, NMLP tracked temporal lipid remodelling across pre-diet, on-diet, and post-diet states, consistent with metabolic adaptation to increased dietary fat exposure. In the insulin-sensitivity dataset, insulin-resistant subjects showed a storage-oriented lipid phenotype characterized by increased neutral lipid storage indices and altered lipid quality and oxidative-protection features. Category-level clustering further revealed heterogeneous nutritional-metabolic states within insulin-resistant subjects. ConclusionsNMLP provides a deeper and clearer interpretative framework for human plasma lipidomics in nutrition and metabolic health research. By translating lipid species into functional indices and category-level readouts, the module may facilitate the use of lipidomics in clinical nutrition, metabolic phenotyping, and precision nutrition studies. NMLP is freely accessible as part of the online LipidOne platform.

19

On the state of protein function prediction: a report on the fourth CAFA challenge

Ramola, R.; De Paolis Klauza, M. C.; Piovesan, D.; Peng, Y.; Joshi, P.; Mehdiabadi, M.; Quaglia, F.; Pancsa, R.; Chemes, L. B.; Ahmadi, M.; Ahn, H.; Altenhoff, A. M.; Asgari, E.; Aspromonte, M. C.; Atalay, V.; Babbi, G.; Baldazzi, D.; Barot, M. M.; Ben-Hur, A.; Benso, A.; Berenberg, D.; Bjorne, J.; Boecker, F.; Boldi, P.; Bonello, J.; Bordin, N.; Borole, P.; Ebrahimpour Boroojeny, A.; Cao, R.; Di Carlo, S.; Casadio, R.; Casiraghi, E.; Chang, J.-M.; Chen, C.; Chen, T.-M.; Cheng, J.; Chiu, S.; Dalkiran, A.; Davidovic, R. S.; Dessimoz, C.; Diao, R.; Djeddi, W. E.; Dogan, T.; Flannery, S. T.; Font

2026-05-11 bioinformatics 10.64898/2026.05.06.722942 medRxiv

Top 0.4%

6.4%

Show abstract

BackgroundThe Critical Assessment of Functional Annotation (CAFA) is a community effort held to understand the field of computational protein function prediction. Every three years, since 2010, the organizers initiate an experiment to collect function predictions on a large set of proteins and then evaluate the performance of predicting methods on a subset of proteins that have accumulated experimental annotations between the submission deadline and the evaluation time. CAFA provides an independent and rigorous assessment of the current state of the art, thus leveling the playing field, highlighting successes, revealing bottlenecks, and offering a forum for the exchange of ideas in protein science. Here, we report the results of the fourth CAFA experiment (CAFA4). ResultsCAFA4 featured the participation of 148 methods from 70 research groups on a total of 46,205 unique proteins over a 5-year annotation accumulation phase, the longest in any CAFA. In a comparison across CAFA2-CAFA4 methods, the prediction of Gene Ontology (GO) terms has clearly improved across all three GO aspects and traditional evaluation settings. While not achieving the first rank, several CAFA2 and CAFA3 methods featured in the top ten methods in many evaluations, suggesting that earlier methods still hold relevance. The performance is weaker in the newly introduced "partial knowledge" evaluation category (proteins with experimental annotations before submission deadline that gained additional annotations in the same GO aspect during the annotation accumulation phase), highlighting the need for a new class of methods. The rankings of the methods were stable over the years in traditional evaluation settings, but less so in the new partial knowledge evaluation. Overall, the field continues to progress with some influx of new participants. Sustained efforts will be necessary to substantially advance it.

20

An generative-AI framework for target-Specific MicroRNAs towards RNAi-based drug design

Gu, J.; Li, Y.

2026-05-11 genomics 10.64898/2026.05.07.723585 medRxiv

Top 0.4%

6.4%

Show abstract

MicroRNA (miRNAs) are small non-coding RNAs that regulate gene expression by binding to the target messenger RNA (mRNA), whose versatility has inspired RNA-interference (RNAi)-based drug designs. However, off-target effects lead to unintended gene silencing and toxicity. Existing methods suffer from experimental data scarcity and fail to effectively integrate target specificity into designing de novo small interference RNAs (siRNA). To overcome the above challenges, we present SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR, a specificity-guided generative framework that synthesizes target-conditioned miRNAs. By training on a large experimental data containing 2.2M miRNA-mRNA pairs, SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR minimizes off-target effects with enhanced on-target potency. As a result, SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR-generated miRNAs bind more strongly to the target mRNAs than the observed miRNAs and much less so to off-target mRNAs. We tested SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR on mRNA targets for liver disease, for which 6 FDA-approved siRNA-based drugs were available. SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR recovers binding regions that correspond to FDA-approved siRNA drugs across 3 targets, and demonstrates greater structural specificity for on-target mRNAs than for off-target mRNAs. Together, SO_SCPLOWPECIC_SCPLOWMO_SCPLOWIC_SCPLOWR offers an AI solution to synthesize miRNA-inspired and target-specific siRNA sequences towards RNAi-based drug design.